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Nikita Glukhov 


Senior developer @Postgres Professional 
PostgreSQL contributor 


Major CORE contributions: 
Jsonb improvements 
SQL/JSON (Jsonpath) 
KNN SP-GiST 
Opclass parameters 

Current development: 
SQL/JSON functions 
Jsonb performance 
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° 


NOSQL POSTGRES IN SHORT 


sf f f f 


SQL/JSON — 202? 
° Complete SQL/JSON 
° Better indexing, syntax 


JSONPATH - 2019 
° SQL/JSON — 2016 
° Indexing 


Key-Value Ordered Key-Value Big Table Document, 
Full-Text Search 


oe mu 


JSONB - 2014 
° Binary storage 


EE H—H 
Lz | * Nesting objects & arrays 
| ° Indexing 
` CCE] i 
JSON - 2012 


e Textual storage 
° JSON validation 


HSTORE - 2003 
° Perl-like hash storage 
e No nesting, no arrays 


anggang Postgres Pro 


Postgres revolution: embracing relational databases 
° NoSQL users attracted by the NoSQL Postgres features 


Relative Growth 


= Pg = MySQL = Oracle = MS SQL 


x Dec 18, 2014 
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,. y Json in 
À Z PostgreSQL 


(state of Art) 


2012 


Two JSON data types !!! 


Jsonb vs Json 


SELECT j::json AS json, j: redis AS jsonb FROM 
(SELECT ‘'{"cc":0, "aa": 2, a":1,"b":1)' AS J) AS foo; 
json | jsonb 


("cc":0, "aa": 2, "aa":1,"b":1) | {"b": 1, "aa": 1, "cc": 0} 
e json: textual storage «as is», parsed many 
* jsonb: binary storage, parsed once, great performance (indexing) 
* jsonb: no whitespaces, no duplicated keys (last key win) 
* jsonb: keys are sorted by (length, key) 
° jsonb: a rich set of functions (Ndf jsonb*), "arrow" operators, FTS 
° JsQuery ext. -json query language with GIN indexing support 
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SQL/Foundation recognized JSON after the success of Postgres 


SQL:2016 — 22 JSON features out of 44 new optional. December of 2016 
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SQL/JSON in SQL-201ó 


RFC 1759 


° SQL/JSON data model 


° A sequence of SQL/JSON items, each item can be (recursively) any ot: 


° SQL/JSON scalar — non-null value of SQL types: Unicode character string, numeric, Boolean 
or datetime 


° SQL/JSON null, value that is distinct from any value of any SQL type (not the same as NULL) 
° SQL/JSON arrays, ordered list of zero or more SQL/JSON items — SQL/JSON elements 
° SQL/JSON objects — unordered collections of zero or more SQL/JSON members 
(key, SQL/JSON item) 
° JSON Path language 
° Describes a <projection> of JSON data to be used by SQL/JSON functions 


e SQL/JSON functions (9) 
° Construction functions: values of SQL types to JSON values 


° Query functions: JSON values to SQL types 


JSON Path(JSON values) — SQL/JSON types -> converted to SQL types = 
| ) q xP NY A Pro 


SQL/JSON in PostgreSQL 


RFC 1759 


* SQL/JSON data model 
* Jsonb is the (practical) subset of SQL/JSON data model 
ORDERED and UNIQUE KEYS 


* JSON Path language 

* Describes a «projection» of JSON data (to be used by SQL/JSON functions) 

* Most important part of SQL/JSON - committed to PG12, PG13 (15/15 features) ! 
* SQL/JSON functions - waiting for review (v55, v48) 

° Constructor functions: json[b] construction functions 

* Query functions: functions/operators with jsonpath support 


° Indexes 
* Use already existing indexes (built-in, jsquery) 


Added jsonpath support "8? — €: 


JSONB Projects: What we were working on 


° SQL/JSON functions (SQL-2016) 
° Generic JSON API. Jsonb as a SQL Standard JSON data type. 


° Better jsonb indexing (Jsquery GIN opclasses) 

* Parameters for jsonb operators (planner support functions for Jsonb) 
° JSONB selective indexing (Jsonpath as parameter for jsonb opclasses) 
° Jsonpath syntax extension 

° Simple Dot-Notation Access to JSON Data 
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Current TOP-priority project 


° JSONB - 1st-class citizen in Postgres 
° Efficient storage,select, update, API 
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JSONB Popularity - CREATE TABLE qq (js JSONB) 


State of PostgreSQL 2021 (Survey) Pgsql telegram (6170) — 26.02.2021 


Top 3 features used to organize 
and access data in production 


TRE e SELECT 8061/312083 
o e ou * SQL 4473/144789 


production apps. o JSON[B] 3116/88234 


View full question 


° TABLE 2997/129936 
* JOIN 2345/108860 
° INDEX 1519/74327 
e BACKUP 1484/42618 
e VACUUM 1470/53919 
e REPLICA 707/31036 
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Top-priority: JSONB - 1st-class citizen in Postgres 


e Not enough resources in community (developers, reviewers, committers) 
° SQL/JSON -- 4 years, 55 versions 
° JSON/Table — 48 versions 


° Startups use Postgres and don't care about compatibilty to Oracle/MS SQL 
° Jsonpath is important and committed, SQL/JSON functions are not important 
° Popularity of JSONB — it's mature data type, rich functionality 
° There is a lot to improve in JSONB 
We concentrated on efficient storage, select, update 


° Extendability of JSONB format 
° Extendability of TOAST — data type aware TOAST, TOAST for non-atomic attributes 


n res Pro 


Reality: Unpredictable performance of jsonb 


CREATE TABLE test (jb jsonb); 


ALTER TABLE test ALTER COLUMN jb SET STORAGE EXTERNAL; 
INSERT INTO test 


Small update cause 10 times slowdown ! SELECT 


jsonb build object( 
‘id’, i, 
'foo', (select jsonb agg(0) from generate series(1, 1960/12)) -- [0,0,0, ...] 
=# EXPLAIN(ANALYZE, BUFFERS) SELECT jb-»'id' FROM test; ciem 
QUERY PLAN generate series(1, 10000) i; 
Seq Scan on test (cost-0.00..2625.00 rows=10000 width=32) (actual time=0.014..6.128 rows=10000 loops=1) 
Buffers: shared hit-2500 
Planning: 
Buffers: shared hit=5 
Planning Time: 0.087 ms 
Execution Time: 6.583 ms 
(6 rows) 


=# UPDATE test SET jb = jb || '("bar": "baz"}'; 


=# EXPLAIN (ANALYZE, BUFFERS) SELECT jb->'id' FROM test; 
QUERY PLAN 
Seq Scan on test (cost=0.00..2675.40 rows=10192 width=32) (actual time=0.067..65.511 rows=10000 loops=1) 
Buffers: shared hit=32548 
Planning Time: 0.044 ms 
Execution Time: 66.889 ms What is happened ? Row gets TOASTed ! 


(4 rows) 
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jsonb 
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number of chunks 
100K 


10K j 
raw jsonb size, bytes 


SELECT expr FROM test_toast; 
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number of chunks 
raw jsonb size, bytes 


10K 


SELECT expr FROM test toast; 
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TOAST Explained 


The Oversized-Attribute Storage Technique 


° TOASTed value is pglz compressed 
* Compressed value is splitted into the fixed-size TOAST chunks (1996B for 8KB page) 


* TOAST chunks (along with XE 
generated Oid chunk id and RES RE 
sequnce num ber chun k seq) J PG TOASTer (typstorage = "extended") 
stored in special TOAST " palz compression 


pg toast.pg toast XXX, created J RM | 


for each table containing B 


TOASTable attributes | 


° Attribute in the original heap Heap rel [/ TOAST rel | N 


/ / \ \ 
M4 V Z <7 
1 


tuple is replaced with TOAST 
pointer (18 bytes) containing KEE PD ao | ia] + [omnes | foa] 2] coz | [cia] 3 [aunis 
chunk id, toast relid, — 


raw size, compressed size 


L) 


https://www.postgresql.org/docs/current/storage-toast.html 


TOAST access 


° TOAST pointers does not refer to heap tuples with chunks directly. Instead they 
contains Oid chunk id and we need to descent by index (chunk id, chunk seq). 


Heap relation 


Heap Page x N 


Overhead to read only a few 
Heap Tuple PA: N NG bytes from the first chunk IS 
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TOAST passes 


° Tuple is TOASTed if its size is more than 2KB (1⁄4 of page size). 
° There are 4 TOAST passes. 


° At the each pass considered only attributes of the specific storage type 
(extended/external or main) starting from the largest one. 


° Plain attributes are not TOASTed and not compressed at all. 


° The process can stop at every step, if the resulting tuple size becomes 
less than 2KB. 


° |f the attributes were copied from the other table, they can already be 
compressed or TOASTed. 


° TOASTed attributes are replaced with TOAST pointers. 


n res Pro 


TOAST pass #1 


° Only "extended" and "external" attributes are considered, "extended" 
attributes are compressed. If their size is more than 2KB, they are TOASTed. 


TOAST pass #2 


° Only "extended" and "external" attributes (that were not TOASTed in the 
previous pass) are considered. 


° Each attribute is TOASTed, until the resulting tuple size < 2KB. 


TOAST pass #3 


° Only "main" attributes are considered. 
° Each attribute is compresed, until the resulting tuple size < 2KB. 


pglz compression 


AES cs | 


TOAST pass #4 


° Only "main" attributes are considered. 
* Each attribute is TOASTed, until the resulting tuple size « 2KB. 


C T = eee] 


P | ca 


Motivational example (synthetic test) 


° A table with 100 jsonbs of different sizes (130B-13MB, compressed to 130B-247KB): 
CREATE TABLE test toast AS 
SELECT 


rc 
jsonb build object ( 
'key1', i, 
'key2', (select jsonb agg(0) from 


generate series(1, pow(10, 1+ 5.0 * i / 100.0)::int)),- 
'key3', i, 


'key4', (select jsonb agg(0) from 
generate series(1, pow(10, 0 + 5.0 * i / 100.0) ::int)) -- 


- 10-100k elems 


1-10k elems 
) jb 
FROM generate series(1, 100) i; 

° Each jsonb looks like: Keyl, loooong key2, key3, long key4. 


° We measure execution time of operator -5(jsonb, text) for each row by 
repeating it 1000 times in the query: 


SELECT jb -» 'keyN', jb -» 'keyN', .. jb -» 'keyN' FROM test toast WHERE id = 
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Motivational example (synthetic test) 


Key access time for TOASTed jsonbs linearly increase with jsonb size, 
regardless of key size and position. 


SELECT jb->'keyN' FROM test toast, SELECT jb->'keyN' FROM test toast; 


master master 


key 
eb key1 
@ key2 (120B-12MB) 


key3 


-> execution time, us 
blocks read by -> 


key4 (12B-1.2MB) 


1K 2K 10K 100K 1M 
raw jsonb size, bytes 


TOAST performance problems (synthetic test) 


Key access time for TOASTed jsonbs linearly increase with jsonb size, 
regardless of key size and position. 


SELECT jb->'keyN' FROM test toast; SELECT jb->'keyN' FROM test toast; 


master master 


key1 
key2 (120B-12MB) 


key3 


-> execution time, us 
blocks read by -> 


key4 (12B-1.2MB) 


1K 2K j 10K j 10K 100K 
compressed jsonb size, bytes compressed jsonb size, bytes 


Motivational example (IMDB test) 


° Real-world JSON data extracted from IMDB database (imdb-22-04-2018-json.dump.gz) 


° Typical IMDB «name» document looks like: 
( 


"id": "Connors, Steve (V)", 
"roles": 
"role": "actor", 
"title": "Copperhead Creek (????)" 
J; 
i 
"role": "actor", 
"title": "Ride the Wanted Trail (????)"+ 
} 


], 
"imdb id": 1234567 


* There are many other infrequent fields, but only id, imdb id are mandatory, 
and roles array is the biggest and most frequent (see next slide). 
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IMDB data set field statistics 


size, B 


10K 100K 1M 10 100 1K 10K 100K 1M 10 100 1K 10K 100K 1M 
jsonb compressed jsonb id 


10 100 1K 10K 100K 1M 10 100 1K 10K 100K 1M 10 100 1K 10 100 1K 10K 100K 1M 
aka[] roles[] height 


quotes[] trivia[] 


spouses[] articles[] salaries[] 


real name interviews[] pictorials[] portrayals[] trademarks[] 


fraction, % 


biographies[] 


Aa. —- = "SE 


other works[] cover photos[] date of birth 


"amsami! 


N === 


date of death print biography[] biographical movies[] 


Motivational example (IMDB test) 


SELECT jb -> 'key' FROM imdb.names; raw vs stored jsonb size 


master 


storage type 


inline 
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IK ` 10K 100K 1K 2K 10K 
raw jsonb size, bytes raw jsonb size, bytes 


Motivation 


° Decompression is the biggest problem. Big overhead of decompression 
of the whole jsonb limits the applicability of jsonb as document storage 
with partial access. 

° Need partial decompression 
° Toast introduces additional overhead - read too many block 
° Read only necessary blocks — partial detoast 
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Jsonb deTOAST improvements 


° Partial pglz decompression 
* Decompress only necessary part of jsonb. 


° Sort jsonb object key by their length 
° Move long keys to the end, let short keys stay inlined. 
° Partial deTOASTing using TOAST iterators 
° Detoast only necessary TOAST chunks 
° Inline TOAST 
° Use inline storage to store part of the 14 chunk, let short keys stay inlined 


* Shared TOAST 


* Share non-modified TOAST chunks between versions (partial update), save 
storage size and WAL traffic. 


n res Pro 


Jsonb partial decompression 


° Partial decompression eliminates overhead of pglz decompression of the whole jsonb. 


° Jsonb is decompressed step by step: header, KV entries array, key name and key value. 
Only prefix of jsonb has to be decompressed to acccess a given key ! 


read $ header 
read $ header 


— 
— 


binary search $.k1 


full decompression partial decompression ` 
Pos%gresPro 


Jsonb partial decompression results (synthetic) 


Access to key1 (red) in the prefix of jsonb was significantly improved: 
° For inline compressed jsonb access time becomes constant 
° For jsonb > 1MB acceleration is of order(s) of magnitude. 


SELECT jb->'keyN' FROM test toast; 


*part decomp 
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key2 (120B-12MB) 
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key4 (12B-1.2MB) 


Jsonb partial decompression results (IMDB) 


° Access to the first key <id> and rare key <height> was significantly improved. 


° Access time to big key < » and short «imdb id» remains mostly unchanged 


SELECT jb -> 'key' FROM imdb.names; 


«part decomp 


id 
roles 
height 


imdb_id 


o 
=" 
[o] 

E 
= 
c 
9 
= 
[s 
o 
x 
o 

^ 


1M 100 
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Sorting Jsonb keys by length 


In the original jsonb format object keys are sorted by (length,name), so the short keys with 
longer or alphabetically greater names are placed at the end and cannot benefit from the 


partial decompression. Sorting by length allows fast decompressions of the shortest keys 
(metadata). 


' ' ' u $->'k1', $->'k5' 
read $ header $->'k1', $->'k5 read $ header 
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extract $.k5 value extract $.k5 value 


binary search $.k5 
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original: keys names and values sorted by key names new: keys values sorted by their length 
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JSONB Binary Format (src/include/utils/jsonb.h 


jsonb datum jsonb datum 


j number of KV pairs or array elements container type | number of KV pairs or array elements 
B5 JsonbContainer header 28 bits B | JsonbContainer header Ë (8 bits) (28 bits) 
key1 JEntry KV map: 1=>1, 2=>3, 3 =>2 
key2 JEntry [e] key1 JEntry 2 = object 


keys sorted by 
length, name 


N/T/F lengths interleaved with offsets 
vali JEntry every 32 JEntries 
val2 JEntry Te Mosis AA d s lengths interleaved with offsets 
value: el every 32 JEntries 
val3 JEntry by length val3 JEntry 


root . root É 
container nested container container nested container 


DB JsonbContainer header JsonbContainer header n 
( Hl element1 JEntry E] elementi JEntry 
"aa" H "foo" 5 
"b" : 123.45 element2 JEntry element? JEntry 
"e" o: IN] | t3 JEnt "ua". "feo", m 
[ elemen ntry "b" : 123.45, element3 JEntry 
[ 1.234, true, null, "bar" ] E element4 JEntry CB B element4 JEntry 
[ 


) 1.234, ) 
1.234 true, 1.234 
null, 


"bar" [ 1.234, true, null, "bar" ] 


Reve JEIRY value type: value length or offset (28 bits) 
values sorted 
by key 


value type: value length or offset (28 bits) ' k eys sorted by 4 = array 
key3 JEntry c/s/n/ I length, name [s] key2 JEntry 5 = scalar 


ORIGINAL: VALUES SORTED BY KEYS VALUES SORTED BY THEIR LENGTH 
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Sorting jsonb keys by length results (synthetic) 


Access time to the all short keys and medium-length key4 (excluding long key2, 
placed now at the end of jsonb) was significantly speed up: 


SELECT jb->'keyN' FROM test_toast; 


*part decomp *sorted keys 
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key2 (120B-12MB) 
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Sorting jsonb keys by length results (IMDB) 


° Access to the last short key «imdb. id» now also was speed up. 


° There is a big difference in access time (-5x) between inline and TOASTed values. 


SELECT jb -> 'key' FROM imdb.names; 


«part decomp «sorted keys 


id 
roles 


height 


-> execution time, s 


imdb id 
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Partial deTOASTing 


° We used patch <de-TOAST'ing using a iterator> from the CommitFest. It was originally 
developed by Binguo Bao at GSOC 2019. 


° This patch gives ability to deTOAST and decompress chunk by chunk. So if we need 
only the jsonb header and first keys from the first chunk, only that first chunk will be 
read (actually, some index blocks also will be read). 


° We modified patch adding ability to decompress only the needed prefix of TOAST 
chunks. 
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Partial deTOASTing results (synthetic) 


Partial deTOASTing speeds up only access to the short keys of long jsonbs, making 
access time almost independent of jsonb size. 


SELECT jb->'keyN' FROM test toast, 


*sorted keys *part detoast 
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Partial deTOASTing results (IMDB) 


° Results are the same, but not so noticeable because the are not many big (> 100KB) jsonbs. 


° A big gap in access time (-5x) between inline and TOASTed values is still there. 


SELECT jb -> 'key' FROM imdb.names; 


«sorted keys «part detoast 
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Partial deTOASTing results (IMDB) 


° Effect of partial deTOASTing : Arrow operator (—) for short keys always read only 4 blocks (3 
index and 1 heap). 


+sorted_keys +part_detoast 


o 


id 


roles 


height 


blocks read by -> 


imdb id 


m 


100K _ 1M 100 
raw jsonb size, bytes 


Iniline TOAST 


° Store first TOAST chunk containing jsonb header and possibly some short keys inline in the 
heap tuple. 


* We added new typstorage «tapas», ptb ib D 
which is similar to «extended», ep 
except that it tries to fill the tuple | ERA 
to 2KB (if other attributes occupy palz compression 


A N 
less than 2KB) with the chunk cutted 
from the beginning ah m =... — Ü 


of the compressed data. 


N7 


NA. 27 
Heap Tuple TOAST Tuple 0 TOAST Tuple 1 TOAST Tuple 2 


Inline TOAST results (synthetic) 


Partial inline TOAST completely removes gap in access time to short keys between 
long and mid-size jsonbs. 


SELECT jb->'keyN' FROM test toast, 


*part detoast *inline toast 


key1 
key2 (120B-12MB) 


key3 


-» execution time, us 


key4 (12B-1.2MB) 
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Inline TOAST results (IMDB) 


° Results are the same as in synthetic test. 


° There is some access time gap between compressed and non-compressed jsonbs. 


SELECT jb -> 'key' FROM imdb.names; 
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Inline TOAST results (IMDB) 
Effect of inline TOAST : Arrow operator (—) for short keys read no additional blocks. 
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JSONB partial update 


TOAST was originally designed for atomic data types, it knows nothing 
about internal structure of composite data types like jsonb, hstore, and 
even ordinary arrays. 
TOAST works only with binary BLOBs, it does not try to find differencies 
between old and new values of updated attributes. So, when the 
TOASTed attribute is being updated (does not matter at the beginning or 
at the end and how much data is changed), its chunks are simply fully 
copied. The consequences are: 

e TOAST storage is duplicated 

e WAL traffic is increased in comparison with updates of non-TOASTED 

attributes, because the whole TOASTed values is logged 
e Performance is too low 
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JSONB partial update: The problem 


Example: table with 10K jsonb objects with 1000 keys { "1": 1, "2": 2, ... J. 


CREATE TABLE t AS 
SELECT i AS id, (SELECT jsonb object agg(j, j) FROM generate series(1, 1000) j) js 
FROM generate series(1, 10000) i; 


SELECT oid::regclass AS heap rel, 
pg size pretty(pg- relation size(oid)) AS heap rel size, 
reltoastrelid::regclass AS toast rel, 
pg size  pretty(pg. relation -Size(reltoastrelid)) AS toast rel size 
FROM pg class WHERE relname = 't' 


heap rel 1 heap_rel_size 1 toast rel | toast rel size 


t H 512 kB H pg_toast.pg toast 27227 H 78 MB 
Each 19 KB jsonb is compressed into 6 KB and stored in 4 TOAST chunks. 


SELECT pg column size(js) compressed size, pg column size(js::text::jsonb) orig size from t limit 1; 
compressed size | original size 


6043 | 18904 


SELECT chunk id, count(chunk seq) FROM pg toast.pg toast 47235 GROUP BY chunk id LIMIT 1; 
chunk id 1 count 


57241 | — 4 E Pro 


JSONB partial update: The problem 


First, let's try to update of non-TOASTED int column id: 


SELECT pg current wal lsn(); --> 0/157717F0 
UPDATE t SET id = id + 1; -- 42 ms 
SELECT pg current wal lsn(); --> 0/158E5B48 


SELECT pg_size pretty(pg wal lsn diff('0/158E5B48','0/157717F0')) AS wal size; 
wal_size 


1489 kB (150 bytes per row) 


SELECT oid::regclass AS heap rel, 
pg size pretty(pg relation size(oid)) AS heap rel size, 
reltoastrelid::regclass AS toast rel, 
pg size pretty(pg relation size(reltoastrelid)) AS toast rel size 
FROM pg class 
WHERE relname = 't': 


heap rel | heap rel size | toast rel | toast rel size 
---------- I pe ee ee ee ee ee tee ee eee ee eee 
t | 1024 kB | pg toast.pg toast 47235 | 78 MB 

(was 512 kB) (not changed) 
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JSONB partial update: The problem 


Next, let's try to update of TOASTED jsonb column js: 


SELECT pg current wal lsn(); --5 0/158E5B48 
UPDATE t SET js = js — '1'; -- 12316 ms (was 42 ms, -300x slower) 


SELECT pg current wal lsn(); --5 0/1DB10000 


SELECT pg size pretty(pg wal lsn diff('0/1DB10000','0/158E5B48')) AS wal size; 
wal size 


130 MB (13 KB per row; was 1.5 MB, -87x more) 


SELECT oid::regclass AS heap rel, 
pg size pretty(pg relation size(oid)) AS heap rel size, 
reltoastrelid::regclass AS toast rel, 
pg size pretty(pg relation size(reltoastrelid)) AS toast rel size 
FROM pg class 
WHERE relname = 't'; 


heap rel | heap rel size | toast rel | toast rel size 
---------- ee ee ee eee 
t | 1528 kB | pg toast.pg toast 47235 | 156 MB 

(was 1024 kB) (was 78 MB, 2x more) 
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Partial update using Shared TOAST 


The previous optimizations are great for SELECT, but don't help with UPDATE, since TOAST 
consider jsonb as an atomic binary blob - change part, copy the whole. 


Idea: Keep INLINE short fields (uncompressed) and TOAST pointers to long fields to let update 
short fields without modification of TOAST chunks, which will be shared between versions. 


Currently, this works only for root objects fields, so the longest fields of jsonb object are 
TOASTed until the whole tuple fits into the page (typically, remaining size of jsonb becomes < 
~2000 bytes). 


But this technique can also be applied to array elements or element ranges. We plan to try to 
implement it later, it needs more invasive jsonb API changes. 


Currently, jsonb hook is hardcoded into TOAST pass #1, but in the future it willbecome custom 
datatype TOASTer using pg type.typtoast. 
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Shared TOAST - Jsonb format extensions 


° Added special “TOASTed 
container” JEntry type. 
JsonbContainer header is 
left inline, but the body is 
replaced with a pointer. 


* Added “TOASTed object” 
JsonbContainer type to 
mark object with TOAST 
pointers. 


° TOASTed subcontainers 
are stored as plain jsonb 
datums (varlena header 


added). 


jsonb datum 


varlena header ams 
m JsonbContainer header 


1 = so, object, values sorted by length 
2 = o, object 
3 = to, object containing TOASTED fields 


keys sorted by 
length, name 


4 =a, array 
5 = s, scalar 


' 

' 

' 

' 

' 

' T 

' bz Pe 

' Nag 

S 

' N len | value type: 
: values sorted m 

by k `. ! |c/tc/s/n/ 
`. RENTE 
' ` 


root TOASTED nested container [ 1.234, true, null, "bar" ] 
container 


value length or offset (28 bits) 


lengths interleaved with offsets 
every 32 JEntries 


varlena header 


EN RN 
D eee — 

| 
i 


1.234 j 


- , 
JsonbContainer header 


TOAST pointer 


Shared TOAST - tuple structure 


In-Memory Tuple 


` In this example two DINEM — 


longest fields of jsonb are T | 
TOASTed separately | 
two TOAST pointers nis pglz compression pglz compression 
« Operators like -» can C] 
` | 


simply return TOAST | rel | TOAST rel 
pointer as external 


datum, accessing only the PL oo. EDEZN 


inline part of jsonb 


Shared TOAST - update 


Heap Tuple 1 


e When the short inline Mm -- a E - 


jsonb 


field is updated, only the DTI] EDS E 


new version of inline data 
is created. 
Jl 


Heap Tuple 2 


* When some part of long 
the long field is updated, 
the whole container is 
copied, updated and then 
TOASTed back with new 
oid (in the future oids can 
be shared). | | HeapTuple3 — | 


* Unchanged TOASTed u= 


fields are always shared. 


copy, replace $.c.x 


pointer 1 


Shared TOAST - access results (synthetic) 


Gap in access time to short keys is completely removed. Access to mid-size fields is 
slow down, because they are TOASTed instead of stored inline (we need to fix this). 


SELECT jb->'keyN' FROM test toast, 


+inline toast *shared toast 


key1 


pa 
° 
o 


key2 (120B-12MB) 


key3 


-> execution time, Us 


key4 (12B-1.2MB) 


pa 
° 


10M I 1K 
raw jsonb size, bytes 
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Shared TOAST - access results (IMDB) 


° Results are the same as in synthetic test. 
° All short keys is speed up as much as possible. 


SELECT jb -> 'key' FROM imdb.names; 


*inline toast *shared toast 


id 
roles 


height 


v 
m 
a 
E 
= 
= 
S 
2 
3 
o 
w 
x 
w 
A 
' 


imdb id 


1M. 100 
raw jsonb size, bytes 


Shared TOAST - update results (synthetic) 


Update time of short keys does not depend on total jsonb size 
Update time of TOASTed fields depends only on their own size 


UPDATE test toast SET jb = jsonb_set(jb, (keyN,0), ?): 


master shared toast 


key1 
key2 (120B-12MB) 
key3 


key4 (12B-1.2MB) 


query execution time, ms 
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Shared TOAST - update results (synthetic) 


° WALtraffic due to update of short and mid-size keys is greatly decreased 


[^7] 
Q 
= 
á 
5 
D 
2 
Œ 
— 
° 
É 
° 
D 
af 
š 


UPDATE test toast SET jb = jsonb set(jb, (keyN,0), ?); 


master 


100K . 100 | 
compressed jsonb size, bytes 


shared toast 


1K 2K 


key1 
key2 (120B-12MB) 
key3 


key4 (12B-1.2MB) 


@ key2 (120B-12MB) 
key4 (12B-1.2MB) 


*shared toast 
18K10K 100K 1M 10M 


) 


tic 


+inline_toast 
1K 10K 100K 1M 10M 


synthe 


| 


18K10K 100K 1M 10M 


ys 
raw jsonb size, bytes 


*sorted ke 
18K10K 100K 1M 10M 


H 
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2 
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O 
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O 
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LU 
V) 


step results 


1RK10K 100K 1M 10M 


Step 


12K10K 100K 1M 10M 


srl ‘aw uonnoaxe <- 
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= 
c 
9 
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= 
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Step-by-step results (IMDB 


1K 


+part_decomp 


10K 100K 1M 100 


10K 100K 


SELECT jb -> 'key' FROM imdb.names; 


+sorted_keys 


1K 


10K 100K 1M 100 1K 
raw jsonb size, bytes 


+part_detoast 


10K 100K 1M 100 


*inline toast 


1K 


10K 100K 1M 100 


1K 


*shared toast 


10K 100K 1M 


id 
roles 
height 


imdb id 


Popular mistake: CREATE TABLE qq (jsonb) 
(id, {...}::jsonb) Mga ({id,...}::jsonb) 


SELECT expr FROM test_toast; 


number of chunks 
100 1 


+shared_toast 


o 
= 
w 
É 
= 
= 
= 
D 
= 
po | 
o 
° 
x 
w 
c 
2 
o 
[77] 
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10M 100 1K 2K 
raw jsonb size, bytes 


blocks read by expr 


Popular mistake: CREATE TABLE qq (jsonb) 
(id, f...)::jsonb)NAI(fid,...)::jsonb) 


SELECT expr FROM test toast, 


number of chunks 
10 100 1 


*shared toast 


jb-»»'id' 


e jb->'a'->>0 


| ` e 
Le $e?» VP? oh O° MO = ° ot 


' 
10M 100 1K 2K 
raw jsonb size, bytes 


Appendable bytea: Motivational example 


° A table with 100 MB bytea (uncompressed): 


CREATE TABLE test (data bytea); 
ALTER TABLE test ALTER COLUMN data SET STORAGE EXTERNAL; 
INSERT INTO test SELECT repeat('a', 100000000)::bytea data; 


° Append 1 byte to bytea: 


EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) 
UPDATE test SET data = data || 'x'::bytea; 


Update on test (actual time=1359.229..1359.232 rows-0 loops-1) 
Buffers: shared hit=238260 read=12663 dirtied=25189 written=33840 
-> Seq Scan on test (actual time=155.499..166.509 rows=1 loops=1) 

Buffers: shared hit=12665 

Planning Time: 0.127 ms 

Execution Time: 1382.959 ms 


>1 second to append 1 byte !!! 
Table size doubled to 200 MB, 100 MB of WAL generated. 


e Thanks to Alexander ? who raised the problem of (non-effective) streaming into 
bytea at PGConf.Online ! è 
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Motivational example (explanation) 


° Current TOAST is Tum 4 DEN 
not sufficient for aas 
partial updates | 


append|1.5 KB 
° All data is 
deTOASTed befo re In-memory Tuple 2 
I copy first 2 chunks 
in-memory ei append to last chunk 
ere ë add 1 new chunk 
modification 


* Updated data is 
TOASTed back after 
modification with |S 


° naga = ' | 
new TOAST TE > ET 5] ECT =) Er] =) erl] 
' ( (2 KB) (2KB) (1KB) ! 


= 


Appendable bytea: Solution 


° Special datum : 
format: TOAST [Toasts 
° " " " paga pokes. i chunk 0 i chunk 1 i chun 
pointer + inline data MGA EFL > _ 


° Inline data serves as een KB 


a buffer for TOASTing SZ 
° Operator | | does not "nz inline chunk 


deTOAST data, it nme-iskg| (569 
appends inline data 

producing datum in 

the new format 


Appendable bytea: Solution 


° When size of inline — NE: | 
abo dili in ' T TOAST Tuple 1 TOAST Tuple 2 ' TOAST Tuple 3 i 

data exceeds 2 KB TOAST pointer | . ) "n A a A ' 

9 ize =7 KB inline chunk ' Ng chunk 0 oid 1 chunk 1 = oid 2 chunk 2 ' 

TOASTer recognizes Mia SR Dc s 


i mE EINE AA CF ee orea ] 
changes in old and mu =: u 

new datums and ; 
TOASTs only the new prenant 
inline data with the 


same TOAST oid 


* Last not filled chunk N x E | : 
can be rewritten with : E | TOAST Tuple 4 ! TOAST Tuple 5 


MEE! 
creation of new tuple [esta 


version 


2 KB bound 


TOAST pointer cu pret Se: new version of 
size = 9 KB (3.5 KB) SS chunk 2 
inline = 3.5 KB i ' ba? 
I x ^ A 
I `. 
I `. 


TOAST 


* First unmodified 
chunks (0,1) are 


bna Benefit: 7 chunks vs 14 (master) 


Results - motivational example 


° Append 1 byte to bytea: 


EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) 
UPDATE test SET data = data || 'x'::bytea; 


* Update on test (actual time=0.060..0.061 rows=0 loops=1) 
Buffers: shared hit=2 (was 12665) 
-> Seq Scan on test (actual time=0.017..0.020 rows-1 loops=1) 
Buffers: shared hit-1 
Planning Time: 0.727 ms 
Execution Time: 0.496 ms (was 1382 ms) 


2750x speed up! 


° Table size remains 100 MB 
° Only 143 bytes of WAL generated (was 100 MB) 
° No unnecessary buffer reads and writes 
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exec time, ms 


Appendable bytea: append to bytea (time) 


UPDATE test SET data = data || repeat('a', append size)::bytea, 


master 


. Lad 
KONG AG "04 eunte t IPA Ka aata PA SUD mee Ps ep 


100K 


OC OLD NEW 


apo 


^ 


1M 10 
data size (old size), bytes 


append toast 


c 
F «^ a p estote *00 KAN Y ae o” "e" o asa R "e “° CO a” ot 


° oo "a ee te 


E °°. e 5 v 2, pa š T ° 
Vue ma Tia AA p Mete CA Ti" ort as, 


100 


OC 


APPEND SIZE 


append size, bytes 
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Appendable bytea: append to bytea (WAL) 


UPDATE test SET data = data || repeat('a', append size)::bytea, 


master append_toast 


* 
[M 
P en rnnt e". š 
- P P ae ooo anf ost a en 9 moo ip... oh 0599 209 oia ag ood * 00%, NEIT DAAI A 


ITAPAT ESENG NL LOS AI 


append size, bytes 
@ 10 


WAL size, bytes 
5 
A 


a 
MC peco t Poo €" oo "o, 00 "afp e eect a °! 


2 èu ° k usc ^p^ e? itecto, eo r dnd eb 


100K 1M 10 
data size (old size), bytes 


OC OLD #NEW OC INLINED 66B- NEW à 
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Appendable bytea: stream 


Stream organized as follows: 
° 1 row (id, bytea) grows from 0 up to 1Mb 


UPDATE test SET data = data || repeat('a', append size)::bytea WHERE id = 0; COMMIT, 


° append size = 10b, 100b,...,100Kb 
* pg stat statements: time, blocks r/rw, wal 
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Appendable bytea: stream (time) 


UPDATE test SET data = data || repeat('a', append size)::bytea, 


number of chunks 
100 1 


append toast 


append size, bytes 
10 
100 


1K 


exec time, ms 


10K 100K 1M 10 10K 
data size (old size), bytes 


Appendable bytea: stream (WAL 


UPDATE test SET data = data || repeat('a', append size)::bytea, 


number of chunks 
1 


append_toast 


append size, bytes 
10 


WAL size, bytes 


1M 10 
qata size (old size), bytes 


Appendable bytea: stream (througput MB/s 


UPDATE test SET data = data || repeat('a', append size)::bytea, 


append toast 


100 MB/s} 


10 MB/s + 


append size, bytes 


100 KB/s} 10 


append speed, MB/s 


100 B/s 1 


1MB | 0B 
data size, bytes 


Non-scientific comparison PG vs Mongo [ans] 


* Segscan, PG - in-memory, Mongo (4.4.4): 16Gb (in-memory), 4GB (1/2) 


SELECT count(*) FROM imdb.names WHERE jb ->> 'id' = 'Cabiria"; 
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w 
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Summary and references 


* We demonstrated step-by-step performance improvements (with 
backward compatibility), which lead to significant (10X) speedup for 
SELECTs and much cheaper UPDATEs (OLTP Jsonb?) 

° Github: https://github.com/postgrespro/postgres/tree/jsonb shared toast 
e Slides of this talk - http://www.sai.msu.su/-megera/postgres/talks/jsonb-highload-2021.pdf 


* The same technique can be applied to any data types with random 
access to parts of data (arrays, hstore, movie, pdf ...) 


* Appendable bytea: 1000X performance improvements 
* https://github.com/postgrespro/postgres/tree/bytea appendable toast 


* Jsonb is ubiquitous and is constantly developing 
° JSON[B] Roadmap V2, Postgres Professional Webinar, Sep 17, 2020 
° JSON[B] Roadmap V3, Postgres Build 2020, Dec 8, 2020 | 
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TODO (OLTP JSONB, OLAP JSONB) 


e More benchmarks (YCSB, use cases), PG vs Mongo 

* Extend shared TOAST to support strings,arrays, jsonb arrays 
* Prepend, truncate, insert, delete for appendable bytea 

* Pluggable TOAST 

* How to integrate this new stuff into the CORE ? 


e WiredElephant - storage (non-TOASTED) for tree-like structures with big 
attributes ? 
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Ham HyxKHbI Bawn KeMcbil (TecTOBbIe naHHbie n 3anpocbi) |! 


Contact obartunov@postgrespro.ru, n.gluhov@postgrespro.ru for collaboration 
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